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Abstract — The three-dimensional data-driven Anatomic 
Gene Expression Atlas of the adult mouse brain consists of 
numerized in situ hybridization data for thousands of genes, 
co-registered to the Allen Reference Atlas. We propose quan- 
titative criteria to rank genes as markers of a brain region, 
based on the localization of the gene expression and on its 
functional fitting to the shape of the region. These criteria 
lead to natural generalizations to sets of genes. We find 
sets of genes weighted with coefficients of both signs with 
almost perfect localization in all major regions of the left 
hemisphere of the brain, except the pallidum. Generalization 
of the fitting criterion with positivity constraint provides a 
lesser improvement of the markers, but requires sparser sets 
of genes. 

Index Terms — Gene expression, neuroanatomy, optimiza- 
tion, generalized eigenvalue problems. 

I. Introduction: the Anatomic Gene 
Expression Atlas (AGEA) of the adult mouse 
brain 

Neuroanatomy is experiencing a renaissance under the 
influence of molecular biology and computational 
methods. The Allen Institute has built a three- 
dimensional data-driven atlas of the adult mouse 
C57B1/6J (see see the NeuroBlast User Guide 
http://mouse.brain-map.org/, and |[TJ- 
(4)) containing expression data for thousands of genes, 
co-registered to an atlas of brain regions, the Allen 
Reference Atlas (ARA) (5j. However, there is no general 
agreement on the list of brain regions for rodents (see 
J6), 0). Given an anatomical atlas such as the ARA, 
it is therefore natural to ask if brain regions can be 
recognised in the spatial patterns of gene-expression 
data. For a molecular approach to the anatomy of the 
hippocampus, see 0. In the present note we propose 
quantitative criteria formalizing the notion of marker 
genes for brain regions. 

For each gene, an eight-week old C57B1/6J male mouse 
brain was prepared as fresh-frozen tissue, and expression 
data were obtained through the following automated se- 
quence of operations: 



1 . Colorimetric in situ hybridization (a coronal section for 
Satb2 is shown on Figure [Tal l: 

2. Automatic processing of the resulting images: cell- 
shaped objects of size between 10 and 30 microns were 
looked for in each image in order to minimize artefacts; 

3. Aggregation of the raw pixel data into a unique three- 
dimensional grid, with voxel side 200 microns (projec- 
tions of the result is shown on Figure [Tb] 

The mouse brain is therefore partitioned into cubic voxels 
(the whole brain consists of V = 49, 742 voxels). For 
every voxel v, the expression energy of the gene g is de- 
fined as a weighted sum of the greyscale-value intensities 
I evaluated at the pixels p intersecting the voxel: 
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where M(p) is a Boolean mask worked out at step 2 with 
value 1 if the gene is expressed at pixel p and if it is not. 
A maximal-intensity projection of the gene-expression 
energy of Satb2 is shown on Figure [Tb] The expression 
energy E(v, g) is therefore expected to be proportional 
to the quantity of mRNA of gene g in voxel v (there can 
be saturation of the expression energy at large values, but 
the expression energy is still a monotonic function of the 
total number of molecules of mRNA in the voxel). 

For numerical applications we focused on a set of 
genes for which sagittal and coronal sections have been 
produced at the Allen Institute. For each of these genes, 
we computed the correlation between sagittal and coronal 
data. Some of these correlations are negative, and we 
chose to focus on three quarters of the genes (G = 3, 041 
genes), that make up the top of the distribution of corre- 
lation coefficients. The gene-expression data we consider 
therefore consist of a voxel by gene matrix E defined 
in Equation Q] Moreover, the Allen Reference Atlas is 
registered to the same grid as the gene-expression data, 
so that each voxel in the brain is annotated according 
to which region it belongs. The ARA comes with several 
partitions of the brain, of varying coarseness. In particular, 
the left hemisphere is partitioned into 12 disjoint regions 
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Fig. 1: ISH-stained coronal slice of brain tissue and 
numerized data for Satb2. (a) A coronal section of brain 
tissue. Colorimetric ISH gives rise to a blue precipitate 
where an mRNA for Satb2 is present, (b) A maximal- 
intensity projection of the three-dimensional data result- 
ing from the co-registration of all coronal sections for 
Satb2 to a regular grid, at a resolution of 200 microns. 



in the ARA (each of which has one connected component, 
see Table [I] for a list of these regions, and Figures 3b 
and 4c for an illustration of the cerebral cortex and the 
midbrain, respectively). This partition is referred to as the 
Bigl2 annotation. In the present note we will focus on 
this annotation for definiteness. 

For computational purposes, a brain region uj is therefore 
equivalent to a set of row indices in the matrix E, and to a 
(normalized) vector Xu m the T^-dimensional voxel space, 
where the row indices are the only non-zero entries: 



X w (u) » j; £ w, ^2xu:(v) 2 = 1. 



(2) 



II. Neuroanatomy from gene expression: 

RANKING GENES AS MARKERS 

A. Ranking genes by localization scores 

Given a brain region u of interest, let us define the 
localization score of a gene g as the fraction of the (square 
of the) L 2 norm of its expression energy that is contained 
in the region: 



Ms) 



(3) 



where SI denotes the whole brain. We chose the L 2 -norm 
because it is easy to generalize to a linear combination 
of genes (see next section). 

We computed the localization score of every gene in every 
region of the Bigl2 annotation. These scores induce 
a ranking of genes as markers of each brain region. A 
perfect marker of the region oj according to this criterion 
would have a score of 1 . Going from a region to another 
region, one has to be careful when comparing the values 
of the localization scores: as the volumes of the brain 
regions vary across the atlas, the localization scores A w 
are biased by the size of the region u. We need a 
reference in order to estimate how good a localization 
score is compared to what could be expected for a given 
brain region. For a fixed brain region uj we can use two 
references. 

A gene is a better marker of u> than expected from a 
uniform expression if its score X UJ (g) is larger than the 
uniform reference defined as 

■r Vol uj 
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A data-driven reference is given by the localization score 
of the average gene-expression profile: 
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A gene is a better marker of oj than expected from an 
average expression if its localization score X u (g) is larger 
than A a r orago . 



v=l 



Each gene corresponds to a column of the matrix E, 
which is also a vector in a IZ-dimensional space. A 
marker gene is therefore a gene for which this vector is 
closely aligned with Xu- In the next section we propose 
two quantitative criteria formalizing this notion. 



B. Ranking genes by fitting scores 

The localization score does not take into account the 
detailed repartition of the expression energy inside the 
region of interest. It is therefore interesting to study an- 
other ranking of genes, that compares the gene-expression 
profiles to characteristic functions of brain regions. Such 
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Region cj (abbreviation in 
the Allen Reference Atlas) 


Percentage 
of genes 


Percentage 
of genes 




above 


above 




\ uniform 


\ average 


Cerebral cortex (COR) 


59 


26 


Oltactory areas (OC1*) 


41 


40 


Hippocampal region (HIP) 


51 


35 


Retrohippocampal reg. (RHP 


53 


33 


Striatum (STR) 


16 


28 


Pallidum (PAL) 


9 


34 


Thalamus (THA) 


20 


38 


Hypothalamus (HYP) 


15 


33 


Midbrain (MID) 


13 


37 


Pons (PON) 


20 


43 


Medulla (MED) 


30 


47 


Cerebellum (CER) 


22 


40 



TABLE I: Percentage of a set of 3,041 genes in the 
Anatomic Gene Expression Atlas above the uniform and 
average references for the regions in the Big 12 annota- 
tion of the left hemisphere in the Allen Reference Atlas. 
There is no particular solidarity between the two columns. 



a comparison can be based on the functional distance 
between the expression profile and the characteristic func- 
tion of the region. Let us choose the L 2 distance and 
compute the following fitting score for each gene g in a 
given region uj: 

Mg) = i - \ E (^ orn » - xUv)) 2 , (7) 

vEtt 

where E g wrm is the L 2 -normalized g-th column E g of the 
matrix of gene-expression energies: 

E™ m {v) = - T iy=. (8) 

It is also useful to consider E g as a vector in the V- 
dimensional voxel space (it is the gene-expression vector 
of gene g). Just as in the definition of localization scores, 
we could have chosen another norm, but the L 2 -norm 
yields an intersting geometric interpretation of the fitting 
score. Expanding the expression of the fitting score in 
powers of the gene-expression data yields the cosine of 
the angle between the gene-expression vector E g and the 
vector Xuj m voxel space. The fitting score is therefore 
very closely related to the notion of co-expression (which 
for two genes can be defined as the cosine of the angle 
between their expression vectors, which is a useful quan- 
tity to study in order to estimate collective properties of 
sets of genes |9l). 

A perfect marker of the region u> would be a gene with 
fitting score equal to 1. 

There are conflicts between the two induced rankings of 




Pak7 C23004Prox1 Rxfpl Ric8b Ebf4 Lefl Nr5a1 NtsM Dbh VM2 Gabra6 



Fig. 2: Localization scores of the best markers of each 
of the brain regions in the Bigl2 annotation. The fc-th 
column contains the localization scores of the best marker 
of the fc-th region, hence the diagonal look of the figure. 
Gabra6, the best marker of the cerebellum, is the gene 
that maximizes the localization scores across all regions, 
at 98.5 persent. 

genes, for instance Satb2 has the highest fitting score for 
the cortex (and indeed by the look of Figure Q] it is a 
good marker of the cortex), whereas is is ranked 8 by 
the localization scores, with A cor tcx = 0.9345. On the 
other hand, Pak7 is ranked first by localization score, 
and 7th by fitting score. See Figure for a plot of 
best fitting and localization scores in the regions of the 
Bigl2 annotation. Pallidum is the region for which the 
best fitting and localization scores are the lowest, and 
cerebellum is the one for which there are the highest. 

III. Sets of genes as markers 

A. Generalized localization and generalized eigenvectors 

Looking at the scores of the top marker genes for 
each brain regions, it appears that Gabra6 maximizes 
localization scores across all brain regions and all genes, 
whereas the best marker in pallidum is the hardest to 
separate from other brain regions. Hovever, comparing 
the numbers of genes localized above the average and 
uniform reference values, as in Table Q] does not show 
any particular ranking of brain regions. 
In order to find better markers, consider a linear super- 
position of expression energies in our dataset: 

G 

E a {v):=Y,ot g E{v,g), (9) 

3=1 

where G = 3, 041 is the number of genes in our dataset. 
The localization score in the brain region w of a weighted 
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set of genes encoded by Equation [9] is naturally written 



as 



A w (a) 



Evec (£ s a g E{ Vl g) J ^ a t 

a l J n a 



(10) 



where the quadratic forms J" and J n have coefficients 
given respectively by scalar products of the projections 
of gene-expression vectors on u and the whole brain: 

J lh =J2E(v,g)E(v,h), J% h = Y J E(v,g)E{v,h). 

(ID 

The (generalized) localization score Xui(a) is invariant 
under multiplication of the vector a. We can fix this 
dilation invariance by fixing the value of the denominator 
in Equation [10] Maximization the of localization score 
boils therefore down to a maximization of the quadratic 
form J u under a quadratic constraint: 

max aeR G A w (a) = max QgR G Q t jri Q=1 a' J w a. (12) 

Introducing the Lagrange multiplier a associated to the 
constraint, we are led to the maximization of the quadratic 
quantity 



L^ a (a) = a* J" a - a{pt J u a - 1) 



(13) 



The stationarity condition of L^^ wrt the vector a yields 
a generalized eigenvalue problem, 



J" a = aJ u a, 



(14) 



and the maximum value of the generalized localization 
score is the largest generalized eigenvalue, while the 
associated generalized eigenvectors contains the set of 
weights for genes in the best-localized superposition. 

The alternating signs of the coefficients make these 
sets difficult to interpret in terms of transcriptional 
activity, and the plot of the sorted coefficients of the 
generalized eigenvector for the cerebral cortex in Figure 
3c shows that the solutions are not sparse. But these 
algebraic solutions provide absolute bests that one could 
not beat by taking combinations of genes with positive 
coefficients. The negative coefficients allow to offset the 
contribution of some genes outside the region of interest. 



B. Generalized fitting scores and sets of genes weighted 
by positive coefficients 

Considering again a linear combination of gene- 
expression vectors, as in Equation [9] but weighted by 
positive coefficients: it is natural to propose the following 
fitting score, which just consists of the (square of) the L 2 
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Fig. 3: The best set of genes as a generalized eigen- 
vector for the cerebral cortex, (a) A maximal-intensity 
projection of the linear combination of the genes in 
the Adult Gene Expression Atlas corresponding to the 
generalized eigenvector that maximizes the localization 
in the cerebral cortex, (b) A maximal-intensity projection 
of the characteristic function of the cerebral cortex, (c) 
A plot of the sorted coefficients of the genes in the 
generalized eigenvectors. The localization score in the 
cortex in 0.9994. Pak7 is at the second rank by its 
coefficient in the generalized eigenvector, while Satb2 is 
only at the 64th rank. 



distance between the normalized sum of the expression 
energies of all genes in the set, and the characteristic 
function of the region of interest: 



<Ma) = l-^£(iC 
veil 



(15) 

A generalization of the fitting criterion to sets of genes 
that both solves the sign problem and the sparsity problem 
is found quite naturally in terms of an L 2 -L x mini- 
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Coronal Axial 



(a) 




Coronal Axial 



Fig. 4: Best markers by fitting for the midbrain, (a) The 

best single single gene is Slc7a6; (b) the best set of genes 
consists of 8 genes (Slcl7a6, Ephbl, Sema3f, Glra3, 
Noval, Tcf7l2, Ddc, Chrna6), at A = 0.01 x A™ ax brain ; 
(c) Projection of the midbrain. 



mization. The following function penalizes the L 2 error 
function of Equation dl~5b by the L 1 norm of the vector: 

ErrFit- A _ L2 ({a}) = \\E™™ - Xu \\ 2 L *+A\\a\\ L i, (16) 

which can be minimized wrt the weights of the genes iflOl 
using Matlab code by K. Koh: 



= argmin QeR GErrFit£i_ L2 ({a}). 



(17) 



The range of parameter A to be studied can be restricted 
to [0, A™ ax ], where 



A." 



(18) 



because for larger values of A, the quadratic form 
ErrFit£i A r 2 is bounded from below by the squared norm 
of the vector E a , and the solution to the problem of 
Equation (T% is trivially zero. The best fitting score is 
generally a decreasing function of A, while the sparsity 
grows with A. For each region u in the Bigl2 an- 
notation, there is a domain of [0, A™ ax ] for which the 
generalized fitting score (/) u (a^) is larger than the best 
fitting score of a single gene (scores are plotted on Figure 
|5]for A = 0.01 A™ ax ). 
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(c) 

Fig. 5: (a) Best fitting scores and localization scores of 
single genes and of best sets of genes for the brain regions 
of the Bigl2 annotation. The brain regions are sorted by 
the best fitting score of single genes (see Table Q] for the 
abbreviations of the brain regions in the Allen Reference 
Atlas). The largest improvement to fitting brought by 
considering sets of genes rather than single genes in the 
midbrain. (b,c) Table of genes with highest fitting scores, 
and numbers of genes contributing to the best-fitted sets 
of genes. 



IV. Conclusions 

Quantitative methods used to rank single genes as 
markers of brain regions can efficiently spot genes whose 
expression profile outlines a brain region of interest. In 
particular, the generalized localization score can yield 
almost perfectly localized gene expression for all the 
major brain regions except pallidum, at the price of 
involving thousands of genes, weighted by coefficients 
of both signs. The fitting criterion can be generalized 
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to sparse sets of genes with positive coefficients, even 
though the improvement of the scores is less spectacular. 
The complexity of the taxonomy of cell types, and the 
precise anatomical localization in the brain of some of 
these cell types, indicates that there must be combinations 
of large numbers of genes with positive coefficients, 
corresponding to the superposition of genes given by 
Equation [9] that mark some brain regions, quite possibily 
much smaller than the large compartments of the left 
hemisphere we considered here 1(111 - 11761 . 
The quantitative criteria used to define marker genes in 
the present paper are all global in nature, since they all 
involve comparison of gene-expression vectors to brain 
regions in terms of the entire voxel space. This does 
not make use of the fact that the voxels belonging to 
the same region of the ARA form connected sets of 
the left hemisphere. One can make these methods more 
local fTTI and look for genes that are aligned to the 
projection of a brain region to a subspace of voxel space 
that surrounds the region. Such a set of voxels can be 
computed using level-sets of the eikonal distance to the 
boundary of the region ||T8l . Ifl9l . The eikonal distance is 
also a useful geometric tool for the registration of mouse 
skulls to a reference skull, which is currently used in a 
high-throughput neuroanatomy project ||201 . lETI . Genes 
separating brain regions from their environment without 
being particularly well localized or fitted globally are 
shown in ifTTll . Moreover, the conservation properties of 
marker genes (or their lack of conservation properties) 
when going from the mouse atlas to molecular atlases 
for other species will be relevant to the study of brain 
evolution^. 
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